In [1]:
!ls
'5. spark including pre-processing csv file to load into spark.ipynb'
'6. visualizing with lime using sklearn.ensemble.RandomForestClassifier no stop word removal.ipynb'
'7. visualizing with lime using sklearn.ensemble.RandomForestClassifier stop word removal.ipynb'
 nocommas_small_descr_clm_code.csv
 small_descr_clm_code.csv
 spark-warehouse
 work
In [ ]:
# might need to install lime in the docker or system I am using
#!pip install lime
In [5]:
#!pip install nltk
Collecting nltk
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
    100% |████████████████████████████████| 1.4MB 229kB/s ta 0:00:011    99% |███████████████████████████████▉| 1.4MB 1.9MB/s eta 0:00:01
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from nltk)
Collecting singledispatch (from nltk)
  Downloading https://files.pythonhosted.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... done
  Stored in directory: /home/jovyan/.cache/pip/wheels/4b/c8/24/b2343664bcceb7147efeb21c0b23703a05b23fcfeaceaa2a1e
Successfully built nltk
Installing collected packages: singledispatch, nltk
Successfully installed nltk-3.4 singledispatch-3.4.0.3
You are using pip version 9.0.3, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [1]:
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.api as sm
import matplotlib.pyplot as plt
import lime
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn import feature_extraction
from __future__ import print_function
import nltk  
from sklearn.datasets import load_files  
nltk.download('stopwords')  
from nltk.corpus import stopwords  

%matplotlib inline
%config InlineBackend.figure_format='retina'
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [2]:
df = pd.read_csv('small_descr_clm_code.csv')
df.drop('Unnamed: 0',axis=1, inplace=True)
df.head()
Out[2]:
descr clm code
0 CROSS-REFERENCE TO RELATED APPLICATIONS \n ... 1. A computer-implemented method of designing ... 706
1 RELATED APPLICATIONS \n This application i... What is claimed is: \n \n 1 . A sy... 705
2 CROSS REFERENCE TO RELATED APPLICATION \n ... 1. A weather information display device compri... 706
3 TECHNICAL FIELD \n The present disclosure ... 1 . A method of obtaining a user's measure... 705
4 CROSS-REFERENCE TO RELATED APPLICATIONS \n ... 1 . A method for providing borrower foreclosur... 705
In [3]:
def remove_string(dataframe,column_list,string_in_quotes):
    '''
    Input:
            dataframe: name of pandas dataframe
            column_list: list of column name strings (ex. ['col_1','col_2'])
            string_in_quotes: string to remove in quotes (ex. ',')
    
    Output:
            none
            modifies pandas dataframe to remove string.
                
    Example:
            remove_string(df, ['col_1','col_2'], ',')
    
    Warning:
            If memory issues occur, limit to one column at a time.
        
    '''
    for i in column_list:
        dataframe[i] = dataframe[i].str.replace(string_in_quotes,"").astype(str)
In [4]:
remove_string(df, ['descr'],',')
In [5]:
remove_string(df, ['clm'],',')
In [6]:
remove_string(df, ['descr'],'\n')
In [7]:
remove_string(df, ['clm'],'\n')
In [8]:
df.iloc[0]['clm']
Out[8]:
'1. A computer-implemented method of designing a set of experiments to be performed with a set of resources the method comprising: generating by a computer a plurality of configurations based on a set of parameters and experimental constraints each configuration including a plurality of experimental points each experimental point having a set of values for the parameters one or more patterns representing the application of parameters to one or more lattice points of an experiment lattice under the experimental constraints wherein experimental constraints for a given pattern of the one or more patterns are represented by a set of attributes and wherein the generating comprises: generating a plurality of pattern instances of at least one of the one or more patterns each pattern instance defined by a set of attribute values for the set of attributes representing the experimental constraints for the at least one of the one or more patterns the set of attribute values specifying a quantity of a parameter to be applied at one or more lattice points of the experiment lattice; and  combining the plurality of pattern instances to generate a configuration such that the parameter values for an experimental point in the configuration generated by combining the plurality of pattern instances are based on the parameter values specified by the combined pattern instances for a corresponding lattice location;   selecting by the computer a configuration from the plurality of configurations;  defining by the computer a set of experiments based on the selected configuration; and  displaying by the computer a visual representation of the defined set of experiments  wherein the set of parameters includes a plurality of factors to be varied in the set of experiments and represents axes defining a parameter space and  wherein the experimental constraints represent limitations on operations that can be performed with the set of resources.                   2. The method of  claim 1  further comprising: outputting the design in a format suitable for implementation using an automated synthesis tool.                   3. The method of  claim 1  further comprising: preparing a library embodying the set of experiments based on the design.                   4. A computer program product on a computer-readable storage medium for designing a set of experiments to be performed with a set of resources the program comprising instructions operable to cause a programmable processor to: generate a plurality of configurations based on a set of parameters and a set of constraints each configuration including a plurality of experimental points each experimental point having a set of values for the parameters one or more patterns representing the application of parameters to one or more lattice points of an experiment lattice under the set of constraints wherein constraints for a given pattern of the one or more patterns are represented by a set of attributes;  generate a plurality of pattern instances of at least one of the one or more patterns each pattern instance defined by a set of attribute values for the set of attributes representing the set of constraints for the at least one of the one or more patterns the set of attribute values specifying a quantity of a parameter to be applied at one or more lattice points of the experiment lattice;  combine the plurality of pattern instances to generate a configuration such that the parameter values for an experimental point in the configuration generated by combining the plurality of pattern instances are based on the parameter values specified by the combined pattern instances for a corresponding lattice location;  select a configuration from the plurality of configurations;  define a set of experiments based on the selected configuration; and  display a visual representation of the defined set of experiments  wherein the set of parameters includes a plurality of factors to be varied in the set of experiments and represents axes defining a parameter space and  wherein the set of constraints includes one or more experimental constraints representing limitations on operations that can be performed with the set of resources.                   5. The computer program product of  claim 4  further comprising: instructions operable to cause a programmable processor to output the design in a format suitable for implementation using an automated synthesis tool.                   6. The computer program product of  claim 4  further comprising: instructions operable to cause a programmable processor to cause an automated synthesis device to prepare a library embodying the set of experiments based on the design.                   7. A computer-implemented system for designing a set of experiments to be performed with a set of resources the system comprising: a set of resources comprising one or more automated synthesis devices for carrying out a set of experiments;  a memory storing a set of parameters experimental constraints and one or more patterns the set of parameters including a plurality of factors to be varied in the set of experiments and representing axes defining a parameter space the experimental constraints representing limitations on operations that can be performed with the set of resources the one or more patterns representing the application of parameters to one or more lattice points of an experiment lattice under the experimental constraints wherein experimental constraints for a given pattern of the one or more patterns are represented by a set of attributes; and  a programmable processor configured to perform operations comprising: generating a plurality of configurations based on the set of parameters and the experimental constraints each configuration including a plurality of experimental points each experimental point having a set of values for the parameters the generating comprising: generating a plurality of pattern instances of at least one of the one or more patterns each pattern instance defined by a set of attribute values for the set of attributes representing the experimental constraints for the at least one of the one or more patterns the set of attribute values specifying a quantity of a parameter to be applied at one or more lattice points of the experiment lattice; and  combining the plurality of pattern instances to generate a configuration such that the parameter values for an experimental point in the configuration generated by combining the plurality of pattern instances are based on the parameter values specified by the combined pattern instances for a corresponding lattice location;   selecting a configuration from the plurality of configurations;  defining the set of experiments based on the selected configuration; and  outputting a design for the defined set of experiments in a format suitable for implementation using one or more of the automated synthesis devices.                    8. The system of  claim 7  wherein: the programmable processor is configured to display a visual representation of the defined set of experiments over a graphical user interface.                   9. The system of  claim 7  wherein: the automated synthesis devices include one or more devices for performing operations at one or more locations represented by lattice points of the experiment lattice; and  the one or more patterns include one or more device patterns having attributes representing constraints associated with the one or more devices.                   10. The system of  claim 9  wherein: the operations include process steps for applying parameters at the locations.                   11. The system of  claim 10  wherein: the process steps include depositing materials at one or more of the locations.                   12. The system of  claim 10  wherein: the process steps include subjecting materials at one or more of the locations to processing conditions.                   13. The system of  claim 9  wherein: the device pattern attributes for one or more device patterns include one or more device geometry attributes specifying a geometry in which a parameter will be applied to a substrate.                   14. The system of  claim 13  wherein: the device geometry attributes include a thickness attribute representing a quantity of the parameter to be applied.                   15. The system of  claim 9  wherein: the one or more devices comprise a mask; and  one or more of the device patterns represent openings in the mask for exposing locations on a substrate.                   16. The system of  claim 9  wherein: the one or more devices comprise a shutter mask system for exposing locations on a substrate; and  one or more of the device patterns represent openings in the shutter mask system.                   17. The system of  claim 9  wherein: the one or more devices comprise a set of dispensing tips for delivering materials to locations on a substrate; and  one or more of the device patterns represent the set of dispensing tips.                   18. The system of  claim 9  further comprising: a set of dispensing tips for delivering materials to locations on a substrate;  wherein the plurality of pattern instances includes a plurality of device pattern instances specifying amounts of one or more materials to be deposited at locations on the substrate.                   19. The system of  claim 7  wherein: the experimental constraints comprises one or more component patterns representing an arrangement of materials to be used in performing a set of experiments; and  generating the plurality of pattern instances includes superimposing the pattern instances with the component patterns such that the pattern instances represent the application of the arrangement of materials to lattice points of the experiment lattice.                   20. The system of  claim 19  wherein: the component patterns include a component pattern representing a library lattice for a parent library of materials to be used in performing a set of experiments.                   21. The system of  claim 19  wherein: the one or more component patterns include a first component pattern representing a first arrangement of materials that could be used in performing the set of experiments and a second arrangement of materials that could be used in performing the set of experiments;  generating the plurality of configurations includes generating a first configuration based on the first component pattern and a second configuration based on the second component pattern; and  selecting the configuration includes identifying an optimum component pattern from the first and second component patterns.                   22. The system of  claim 7  wherein: combining the plurality of pattern instances includes superimposing a plurality of pattern instances with one or more experiment lattices.                   23. The system of  claim 7  wherein: each configuration in the plurality of configurations represents a set of experiments that can be performed with the set of resources.                   24. The system of  claim 7  wherein: generating the plurality of configurations includes repeating the steps of generating the plurality of pattern instances and combining the plurality of pattern instances.                   25. The system of  claim 24  wherein: generating the plurality of configurations includes generating a plurality of sets of pattern instances by varying the number and/or attribute values of pattern instances.                   26. The system of  claim 7  wherein: generating the plurality of configurations includes generating a first configuration and subsequently generating a sequence of second configurations each of the second configurations being generated by adding a pattern instance to a preceding configuration in the sequence removing a pattern instance from a preceding configuration in the sequence or changing an attribute value for an attribute of a pattern instance in a preceding configuration in the sequence.                   27. The system of  claim 7  wherein: combining the plurality of pattern instances includes defining a sequence of pattern instances the points in the configuration being defined in part by order information derived from the sequence.                   28. The system of  claim 27  wherein: generating the plurality of configurations includes generating a first configuration and subsequently generating a sequence of second configurations each of the second configurations being generated by adding a pattern instance to a preceding configuration in the sequence removing a pattern instance from a preceding configuration in the sequence changing an attribute value for an attribute of a pattern instance in a preceding configuration in the sequence or changing the position of a pattern instance in the sequence.                   29. The system of  claim 27  wherein: selecting the configuration includes identifying an optimum sequence of events for the set of experiments.                   30. The system of  claim 7  wherein: the one or more patterns includes patterns representing alternate applications of parameters to lattice points of an experiment lattice the one or more patterns including a first pattern defined by a first set of attributes and a second pattern defined by a second set of attributes the second set of attributes differing from the first set of attributes in at least one attribute;  generating the plurality of configurations includes combining instances of the first pattern to generate a first configuration and combining instances of the second pattern to generate a second configuration; and  selecting the configuration includes identifying an optimum pattern from the first and second patterns.                   31. The system of  claim 7  wherein: the one or more experiment lattices include a first experiment lattice representing a first arrangement in which a set of experiments could be performed and a second experiment lattice representing a second arrangement in which the set of experiments could be performed;  generating the plurality of configurations includes superimposing pattern instances with the first experiment lattice to generate a first configuration and superimposing pattern instances with the second experiment lattice to generate a second configuration; and  selecting the configuration includes identifying an optimum experiment lattice from the first and second experiment lattices.'

Optimizing object types using categoricals

Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column. citation

In [9]:
df['code'] = df['code'].astype('category')
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5836 entries, 0 to 5835
Data columns (total 3 columns):
descr    5836 non-null object
clm      5836 non-null object
code     5836 non-null category
dtypes: category(1), object(2)
memory usage: 97.1+ KB
In [11]:
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5836 entries, 0 to 5835
Data columns (total 3 columns):
descr    5836 non-null object
clm      5836 non-null object
code     5836 non-null category
dtypes: category(1), object(2)
memory usage: 879.4 MB

Lime Visualization example with 2 class classifier (only using 'descr' column

In [17]:
from sklearn.model_selection import train_test_split
X = df['descr']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

Let's use the tfidf vectorizer, commonly used for text.

There are some important parameters that are required to be passed to the constructor of the class. The first parameter is the max_features parameter, which is set to 1500. This is because when you convert words to numbers using the bag of words approach, all the unique words in all the documents are converted into features. All the documents can contain tens of thousands of unique words. But the words that have a very low frequency of occurrence are unusually not a good parameter for classifying documents. Therefore we set the max_features parameter to 1500, which means that we want to use 1500 most occurring words as features for training our classifier.

The next parameter is min_df and it has been set to 5. This corresponds to the minimum number of documents that should contain this feature. So we only include those words that occur in at least 5 documents. Similarly, for the max_df, feature the value is set to 0.7; in which the fraction corresponds to a percentage. Here 0.7 means that we should include only those words that occur in a maximum of 70% of all the documents. Words that occur in almost every document are usually not suitable for classification because they do not provide any unique information about the document.

Finally, we remove the stop words from our text since, in the case of sentiment analysis, stop words may not contain any useful information. To remove the stop words we pass the stopwords object from the nltk.corpus library to the stop_wordsparameter.

The fit_transform function of the CountVectorizer class converts text documents into corresponding numeric features. citation

stopwords removal

In [22]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False, max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.

In [23]:
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train)  #rf.fit(train_vectors, newsgroups_train.target)
Out[23]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [24]:
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
Out[24]:
array([ 0.93092551,  0.90653635])
In [25]:
pred2 = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred2, average='weighted')
Out[25]:
0.92017452263333466

Explaining predictions using lime

Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.

In [26]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
In [27]:
print(c.predict_proba([X_test.iloc[0]]))
[[ 0.832  0.168]]

Now we create an explainer object. We pass the class_names a an argument for prettier display.

In [28]:
from lime.lime_text import LimeTextExplainer
In [29]:
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)

We then generate an explanation with at most 6 features for an arbitrary document in the test set.

In [30]:
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
/opt/conda/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
Document id: 83
Probability(706) = 0.604
True class: 706
In [31]:
print(y_test.iloc[idx])
706
In [32]:
print(X_test.iloc[idx])
BACKGROUND      This invention relates to the field of process automation. In particular the invention relates to adaptive business process automation.      Business processes are often defined as a flow sequence of operations on a single or set of systems or applications. This flow is composed of interactions with screens of the system in which the operator needs to verify fields of information for a valid content take note of certain fields for later processing enter new information or update existing fields and navigate between screens.      Some business processes involve systems or applications that were not designed with optimization of operator work in mind. Some business processes involve a combination of systems which were not designed to interact and work together. Therefore these processes are often cumbersome involve many operations on possibly different screens in possibly different systems and hence error prone.      The result of these issues is that often an operator needs to invest a considerable amount of time and effort in order to complete the business process. This time and effort has a substantial cost for the organization which can be significantly reduced.      There are existing systems that assist operators and users in these processes. For example there are tools that help users fill online forms using pre-populated information of the user (e.g. name address phone number) as well as more elaborate systems that can automatically extract such information from the user&#39;s actions (e.g. password managers) and resources such as email correspondence.      BRIEF SUMMARY      According to a first aspect of the present invention there is provided a method for process automation comprising: monitoring one or more workstations including monitoring screen contents and user actions at the workstation; analysing the screen contents and user actions into monitored functional events; providing multiple focal states as defined sequences of functional events with one or more facilitating scripts associated with a focal state wherein a facilitating script provides one or more automatic actions; matching a sequence of monitored functional events to a defined sequence of functional events of a focal state; and applying the one or more automatic actions of a facilitating script associated with the matched focal state; wherein said steps are implemented in either: computer hardware configured to perform said identifying tracing and providing steps or computer software embodied in a non-transitory tangible computer-readable storage medium.      According to a second aspect of the present invention there is provided a computer program product for process automation the computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith the computer readable program code comprising: computer readable program code configured to: monitoring one or more workstations including monitoring screen contents and user actions at the workstation; analysing the screen contents and user actions into monitored functional events; providing multiple focal states as defined sequences of functional events with one or more facilitating scripts associated with a focal state wherein a facilitating script provides one or more automatic actions; matching a sequence of monitored functional events to a defined sequence of functional events of a focal state; and applying the one or more automatic actions of a facilitating script associated with the matched focal state.      According to a third aspect of the present invention there is provided a system for process automation comprising: a processor; a monitoring agent for monitoring one or more workstations including monitoring screen contents and user actions at the workstation; a current set module for analysing the screen contents and user actions into monitored functional events; a focal state provider for providing multiple focal states as defined sequences of functional events and a facilitating script provider providing one or more facilitating scripts associated with a focal state wherein a facilitating script provides one or more automatic actions; a matching module for matching a sequence of monitored functional events to a defined sequence of functional events of a focal state; and an applying module for applying the one or more automatic actions of a facilitating script associated with the matched focal state.                     BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS        The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention both as to organization and method of operation together with objects features and advantages thereof may best be understood by reference to the following detailed description when read with the accompanying drawings in which:          FIG. 1  is a flow diagram of a method in accordance with an aspect of the present invention;          FIG. 2  is a flow diagram of a method in accordance with an aspect of the present invention;          FIG. 3  is a flow diagram of a method in accordance with an aspect of the present invention;          FIG. 4  is a schematic diagram of a process in accordance with the present invention;          FIG. 5  is a block diagram of a system in accordance with the present invention; and          FIG. 6  is a block diagram of a computer system in which the present invention may be implemented.                   It will be appreciated that for simplicity and clarity of illustration elements shown in the figures have not necessarily been drawn to scale. For example the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further where considered appropriate reference numbers may be repeated among the figures to indicate corresponding or analogous features.      DETAILED DESCRIPTION      In the following detailed description numerous specific details are set forth in order to provide a thorough understanding of the invention. However it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances well-known methods procedures and components have not been described in detail so as not to obscure the present invention.      The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein the singular forms “a” “an” and “the” are intended to include the plural forms as well unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising” when used in this specification specify the presence of stated features integers steps operations elements and/or components but do not preclude the presence or addition of one or more other features integers steps operations elements components and/or groups thereof.      The corresponding structures materials acts and equivalents of all means or step plus function elements in the claims below are intended to include any structure material or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.      A method system and computer program product are described in which operator actions are monitored and the information used in order to have automatic or semi-automatic identification of interesting states in transaction processes. Special scripts are activated in order to facilitate fast and correct transaction conclusion. The impact of such special scripts is monitored so that in future transactions most effective scripts would be applied.      The manual work of the operator is monitored by a monitoring system and information about screens contents and actions the operator performs on these screens is stored. The monitoring is of actions and events from the view of the human operator. The monitoring may include: the entered and retrieved data; the screen contents; the movements of the user around the screen (for example mouse movements); switching between viewed windows files applications or systems; and logs of the timing of each operation. This stored set dynamically updated is referred to as the “current set”.      Information in the current set may be kept on a functional level. For example operator A viewed customer B&#39;s address for 5 seconds operator A viewed last year&#39;s service request for a further 10 seconds and it took operator A 20 seconds to reach the approval decision.      In addition the described system calls for identification of a set of “focal states” in the transaction processing. A focal state may also be defined on a functional level including a sequence of functional events. Such focal states can be predefined by the system administrator (for example a state where an operator reviewed required information and was unable to reach resolution for at least 10 seconds). Alternatively states can be identified by automatic analysis of historic patterns.      Each focal state has an associated single or set of “facilitating scripts”. Again such scripts can be either predefined by the system administrator or prepared automatically based on the analysis of historic logs. Facilitating scripts provide one or more automatic actions such as keyboard or mouse actions or function calls.      A user&#39;s workstation system will start its operation in the conventional manner. However in parallel to the conventional operation the current set would be constantly analyzed in order to identify focal states of interest (either by exact match or fuzzy match where only some of the functional events in the current set match the functional events in the focal state for example using Regular Expression type of fuzzy matching).      Once a focal state is identified the automation system will step in with or without the operator notification and approval and the related facilitating script will be executed. When the system concludes the control is transferred again to the operator. Alternatively the operator will be shown a pop up window with set of possible facilitating scripts to be chosen from manually.      The aim of the facilitating scripts is to facilitate transaction completion (for example by automating fetching of the necessary information or automatic report preparation). This would manifest itself by improvement of certain performance measures (for example elapsed time to the transaction completion or probability of the return to the given transaction due to the customer appeal). The system would monitor such performance measurements and prioritize the scripts accordingly.      In one embodiment “focal states” in the form of sequences of functional events to be identified in the operation of a user&#39;s workstation may be defined by an administrator. One or more “facilitating scripts” associated with a focal state may also be defined by an administrator. A facilitating script is a script of actions to be automatically carried out by a system to aid the human operator starting when the system reaches an identified focal state to which the facilitating action is associated.      In an alternative embodiment the focal states and facilitating scripts may be generated automatically. Referring to  FIG. 1  a flow diagram  100  shows one embodiment of a method of generating automation data sets. A workstation is monitored  101  and information relating to the screen contents and user actions is logged  102  as logged data  111 .      The logged data  111  is analysed by evaluating and abstracting to generate  103  “focal states”  112  in the form of sequences of functional events. For example this may be done by looking for repeated sequences of similar events or repeated data patterns.      One or more facilitating scripts  113  are identified for a focal state  112 . A facilitating script  113  is a script of one or more actions to be automatically carried out by a system. For example facilitating scripts  113  may be automatically generated by analysing and evaluating  104  the logged data for actions following a focal state of functional events. The identified actions can then be translated  105  into a facilitating script for the focal event in the form of a script to be automatically applied. The facilitating script is then stored and associated  106  with the focal state.      Each facilitating script is associated to a focal state in a way that when activated in a future scenario from that focal state it would “bring” the system to a “next” desired state.      Referring to  FIG. 2  a flow diagram  200  shows the method of using the stored data sets for process automation. A workstation is monitored  201  and information relating to the screen contents and user actions is logged  202  and analysed to abstract  203  to functional events. The functional events are stored  204  as the “current set” of user operations.      The current set is monitored  205  and a focal state identified  206  by matching a sequence of functional events of the current set to the sequence of functional events in a focal state.      Facilitating scripts associated with the identified focal state are fetched  207 . It is determined  208  if there are more than one facilitating scripts associated with the identified focal state. If so the options of multiple facilitating scripts are provided  209  to the user.      The user selects  210  a facilitating script or optionally has the choice to ignore the facilitating scripts. If the facilitating scripts are ignored the method continues to monitor  201  the workstation.      If there is only one facilitating script optionally the user is notified of the facilitating script with an option  211  to allow the automation or to ignore it. If the facilitating script option is ignored the method continues to monitor  201  the workstation.      If a facilitating script is accepted it is applied  212  to the user activity to automatically carry out the actions of the facilitating script. The method continues to monitor  201  the workstation.      Referring to  FIG. 3  a flow diagram  300  shows an additional optional aspect in which the impact or cost of the actions of a facilities script is monitored and the facilitating script is dynamically adapted to optimize the effect of the actions.      A focal state is identified  301  and a facilitating script is applied  302  by the same process as described in  FIG. 2 . A facilitating script is applied  302  with an initial implementation state. A cost function evaluation  303  is carried out to measure one or more parameters of the system to determine the effectiveness of the facilitating script. An adaptive algorithm  304  is applied which determines if the result of the cost function evaluation  303  should be encouraged or penalised. A perturbation mechanism adapts  305  the facilitating script to provide an adapted facilitating script. The adapted facilitating script is applied  302  and the process repeated. After many iterations of adapting the facilitating script the system will converge to an optimized solution.      An example of the facilitating script adaptation aspect is described. The focal state may be the detection of the arrival of a transaction of type A (this may include different descriptors such as request type source (customer group) date etc.) which should then be assigned by the system to an operator from a pool of operators. A facilitating script may be a routing script that routes that transaction to an operator. The routing script can be implemented by having random assignment of transactions to each operator with the possibility of biasing (so that for example operator O 1  will have 90% of chance of getting transaction of type A).      An initial implementation of the facilitating script may be that of uniform distribution of transactions to each operator (i.e. no biasing or all the biases equal to each other). The adaptive perturbation mechanism will change the facilitating script (for example by sending a given type of transaction to randomly chosen operator O 1 ).      For each perturbation the system will measure a predetermined cost function (for example the execution time with predetermined penalty for errors or an instantaneous cost for the given transaction minus an average cost for all operators instantaneous cost can be estimated as a transaction execution time increased by say 10 times error rate).      Perturbations that reduce cost function would be encouraged. Perturbations that increase the cost function would be penalized. For instance if routing to O 1  reduced the cost function the system would be biased to send more transactions of this type to O 1 . For each transaction the change in bias may be small. However after many transactions the system will converge. In such a manner the system would move towards the optimal solution.      Referring to  FIG. 4  a schematic diagram  400  shows the process of monitoring a user interface of a user system or workstation to provide automation.      A user system or workstation  401  includes screen contents  402  and user inputs or actions  403 . At a first step a current set  410  is dynamically recorded of functional events  411 - 414 . The current set  410  is obtained from the user system  401  by analysis and abstraction  461  of the contents  402  and actions  403  of the user system  401  to define functional events  411 - 414 .      A set of focal states  421 - 423  is provided with each focal state  421 - 423  defining a sequence of functional events  431 - 433 . One or more facilitating scripts  441 - 442  are provided for each focal state  421 - 423 . Facilitating scripts  441 - 442  provide one or more actions  451 - 452  to be automatically carried out when a focal state  421 - 423  is identified in the user system  401 .      At a second step the focal states  421 - 423  are compared and matched  462  to the functional events  411 - 414  in the current set  410  by exact or fuzzy matching of the functional events  411 - 414  of the current set  410  to functional events  431 - 433  of focal states  421 - 423 .      If a match is found in a third step a facilitating script  441 - 442  associated with the matched focal state  421 - 423  is applied  463  to the user system  401 . A user may have the option to choose if a facilitating script is applied or which one of multiple facilitating scripts is applied.      The adaptive facilitating script aspect would be applied in the apply step  463  of  FIG. 4  with iterations of adaptations of the facilitating script applied with cost analysis of each adaptation.      Referring to  FIG. 5  a block diagram shows a system  500  in which a user system or workstation  510  includes a display  511  such as a screen or other mechanism for showing contents and a user input mechanism  512  controlling inputs or actions such as keyboard inputs pointer device inputs touch screen inputs pointer device movements etc.      The user system  510  is monitored by a monitoring agent  520  to gather information regarding events actions and data. The monitoring may include all (or parts) of the information known either to the user or to the system. User information can be determined for instance by screen scraping and extracting all the information that has been seen by the user. System information can be determined for example by logging performance history for all the users.      The monitoring agent  520  includes a screen contents monitor in the form of a screen scraping module  521  for obtaining character information and graphical information. The screen scraping module  521  includes an optical character recognition (OCR) module  522  for obtaining information and data regarding the contents of the display  511  of the user system  510  as viewed by a human operator. This includes the content of different windows within a display. The screen scraping module  521  also includes a graphical module  524  for obtaining graphical information such as icons in the system and other graphical elements (such as lines boxes etc.) and analysis of the hierarchical structure of the screen.      Screen scraping techniques include capturing the bitmap data of the screen and running it through an OCR engine or in the case of GUI applications querying the graphical controls by programmatically obtaining references to their underlying programming objects both OCR and graphical information objects. A web scraper for obtaining web content may also be included.      The monitoring agent  520  also includes a user action or input monitoring module  523  for monitoring the human operator&#39;s input including both keyboard input and other input device operations and movement including information regarding the navigation of the user around the display  511 .      A log  541  in a storage medium  540  records the monitored information and data including times and durations of actions and inputs.      A process automation mechanism  530  is provided for applying automation to processes carried out on the user system  510 .      The automation mechanism  530  includes a focal state provider  531  for providing focal states  543  in the form of sequences of functional events. The focal state provider  531  may allow an administrator to define the focal states  543  or alternatively may include an automatic focal state generator for analysing the data log  541  and generating focal states.      The automation mechanism  530  also includes a facilitating script provider  532  for providing facilitating scripts  545  in the form of sequences of actions associated with a focal state. The facilitating script provider  532  may allow an administrator to define the actions in the script or alternatively may include an automatic facilitating script generator for analysing the data log  541  and determining actions required in a facilitating script for a focal state.      The automation mechanism  530  includes a dynamically updated current set module  533 . The current set module  533  receives update data on the user system activity from the monitoring agent  520  and abstracts the data to provide a current set  544  of functional events stored in a storage medium  540 .      In an example the current set module  533  uses screen understanding and every screen viewed by the user is analyzed and information content extracted to provide the current set. A typical record would say for example that:         User X viewed screen Y containing fields of Customer Name Customer Address Number of the past enquiries etc.   Optionally the system may also store information contained in each field.   The screen was viewed for n seconds.   Optionally start and end time stamps may be kept.   Optionally it may be useful to correlate all the keystrokes and mouse movements and actions pertaining to the given screen.            The automation mechanism  530  also includes a matching module  534  for comparing and matching stored focal states  543  to functional events in the current set  544 . An applying module  535  applies a facilitating script  544  to the user system  510  for a matched focal state. The automation mechanism  530  may optionally include a user interface  536  to confirm or select a facilitating script  545  before it is applied.      An adaptive mechanism  550  may optionally be included in the automation mechanism  530  to provide adaptation of facilitating scripts  545  to optimize the applied actions. The adaptive mechanism  550  includes a cost function evaluation module  551  for evaluating the cost to the system of applying a facilitating script. The adaptive mechanism  550  also includes an adaptive algorithm module  552  for determining if the result of the cost function evaluation should be encouraged or penalised. A perturbation mechanism  553  changes the facilitating script  545  in accordance with the cost to the system.      The applying module  535  applies adapted facilitating scripts on an iterative basis until an optimized facilitating script for a system is obtained.      Referring to  FIG. 6  an exemplary system for implementing aspects of the invention includes a data processing system  600  suitable for storing and/or executing program code including at least one processor  601  coupled directly or indirectly to memory elements through a bus system  603 . The memory elements can include local memory employed during actual execution of the program code bulk storage and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.      The memory elements may include system memory  602  in the form of read only memory (ROM)  604  and random access memory (RAM)  605 . A basic input/output system (BIOS)  606  may be stored in ROM  604 . System software  607  may be stored in RAM  605  including operating system software  608 . Software applications  610  may also be stored in RAM  605 .      The system  600  may also include a primary storage means  611  such as a magnetic hard disk drive and secondary storage means  612  such as a magnetic disc drive and an optical disc drive. The drives and their associated computer-readable media provide non-volatile storage of computer-executable instructions data structures program modules and other data for the system  600 . Software applications may be stored on the primary and secondary storage means  611   612  as well as the system memory  602 .      The computing system  600  may operate in a networked environment using logical connections to one or more remote computers via a network adapter  616 .      Input/output devices  613  can be coupled to the system either directly or through intervening I/O controllers. A user may enter commands and information into the system  600  through input devices such as a keyboard pointing device or other input devices (for example microphone joy stick game pad satellite dish scanner or the like). Output devices may include speakers printers etc. A display device  614  is also connected to system bus  603  via an interface such as video adapter  615 .      An example of the described process automation can be the selection of an email signature based on the destination. It is supposed that a user often wants to change his signature when sending an email to person X. A facilitating script is written and provided that replaces the signature in the email with the required signature. A set of screens/data/actions are defined as a focal set related to the facilitating script so that whenever the screen of composing a new email appears and a certain name of recipient of person X is written into the ‘to’ field the system will automatically run the related facilitating script and the signature will be replaced.      Customization parameters can be defined for each such facilitating script. Such parameters may include for example whether to run it automatically or after an approval of the user to a pop-up question and specific times (for example only weekends or only evenings) that a facilitating script is active etc.      As an example of the automated learning of relations suppose the operator works on a business process involving a web application and a mainframe system. The operator starts with the web application in which he gets a customer&#39;s details and then needs to go into the mainframe system and perform some action for the customer (in this example change the address of the customer.). A way to improve this manual business process is to have the system automatically learn the relations between the current set and the active screen.      When a relation appears between information seen in one screen and information that is entered in a later screen this relation is stored in a long term memory. After a few iterations going through the same screens if the relation is repeated a predetermined number of times the software will use this relation to suggest to the operator to automatically fill in the fields associated with this relation. The number of times an action is repeated to constitute a reason to suggest that action in the future may be customized by a user. Also it may be defined that there is no contradicting behaviour which should be considered.      In another example a business process requires the operator to key in the ID number from a web application into the mainframe screen. This retrieves the customer&#39;s details and allows the operator to key in an updated address. The monitoring agent monitors this process and identifies that the value keyed in to the ID number field in the mainframe actually appeared in the ‘ID Number’ field in the other application just a moment ago. If this pattern is observed to be repeated several times the monitoring agent can suggest to the operator the next time this occurs that this information can be automatically filled in.      Another example would be the case where intranet password verification is required. This can be a time consuming process where both intranet userid and password are to be filled taking up about 20 seconds of the user time. In principle this problem can be resolved by preparing appropriate facilitating script that would identify the verification screen and fill in the necessary information. However this would mean that the operator would have to know and install such a feature. Using an adaptive approach this would not be necessary. The system would identify automatically the correlation between password verification request and appropriate information being filled. After several cases user verification would be done automatically. Now assume that for some reason the password request screen is redesigned so that data verification would be filled in a different manner. In such a case the operator would have to correct the data that has been filled in automatically. Efficiency of the process would go down. As a result the system would look for (and find) new script to be used under the new circumstances.      An active option may be provided where a system would introduce minute changes and measure the results of such changes in order to determine an optimal approach.      For example a large number of operators may perform various types of transactions. It is assumed for the sake of simplicity that each operator can perform each task. Commonly in such cases all the jobs would be distributed evenly between all the operators. However it may be that one operator is faster doing transactions of type A while another is better doing transactions of type B. In the adaptive system system control would make small changes in the routing policy. For instance it may try routing more jobs of type B to the first operator and more jobs of type A to the second. As a result there would be some small penalty on the system performance and as a result the system would (correctly) reverse the policy ending up with each operator doing transactions where he/she has the best performance. In fact in order to facilitate measurement process some transactions may be routed to several operators.      The described automation may be used for performance measurement or optimization of complex business processes having multiple operators potentially performing complementary tasks. Optimization is done centrally analyzing a complex multi-operator system. For example optimization may target improvement in the transaction quality which may not be visible on at a single workstation level; hence the optimization focuses on the impact of the given process on the final quality as measured at a later stage in the transaction processing.      For example the system may determine that for the given focal point facilitating script used by the operator O 1  yields lower error than facilitating scripts used by operator O 2  and O 3 . Hence facilitating scripts for operators O 2  and O 3  will be amended. Alternatively the system may determine that even for given facilitating script different operators feature different error rate. Hence for instance transactions performed by operator O 1  can be accepted as is. For operators O 2  and O 3  the same transaction must be routed to both of them with differences being resolved by the operator O 4 .      In the above description the focus is on the business process applications. However a similar approach can be applied to personal computing. In other words an adaptive process automation mechanism may work on a personal workstation.      An adaptive business process automation may be provided as a service to a customer over a network.      As will be appreciated by one skilled in the art aspects of the present invention may be embodied as a system method or computer program product. Accordingly aspects of the present invention may take the form of an entirely hardware embodiment an entirely software embodiment (including firmware resident software micro-code etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit” “module” or “system.” Furthermore aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.      Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be for example but not limited to an electronic magnetic optical electromagnetic infrared or semiconductor system apparatus or device or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires a portable computer diskette a hard disk a random access memory (RAM) a read-only memory (ROM) an erasable programmable read-only memory (EPROM or Flash memory) an optical fiber a portable compact disc read-only memory (CD-ROM) an optical storage device a magnetic storage device or any suitable combination of the foregoing. In the context of this document a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system apparatus or device.      A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein for example in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms including but not limited to electro-magnetic optical or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate propagate or transport a program for use by or in connection with an instruction execution system apparatus or device.      Program code embodied on a computer readable medium may be transmitted using any appropriate medium including but not limited to wireless wireline optical fiber cable RF etc. or any suitable combination of the foregoing.      Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages including an object oriented programming language such as Java Smalltalk C++ or the like and conventional procedural programming languages such as the “C” programming language or similar programming languages. The program code may execute entirely on the user&#39;s computer partly on the user&#39;s computer as a stand-alone software package partly on the user&#39;s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario the remote computer may be connected to the user&#39;s computer through any type of network including a local area network (LAN) or a wide area network (WAN) or the connection may be made to an external computer (for example through the Internet using an Internet Service Provider).      Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams and combinations of blocks in the flowchart illustrations and/or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer special purpose computer or other programmable data processing apparatus to produce a machine such that the instructions which execute via the processor of the computer or other programmable data processing apparatus create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.      These computer program instructions may also be stored in a computer readable medium that can direct a computer other programmable data processing apparatus or other devices to function in a particular manner such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.      The computer program instructions may also be loaded onto a computer other programmable data processing apparatus or other devices to cause a series of operational steps to be performed on the computer other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.      The flowchart and block diagrams in the Figures illustrate the architecture functionality and operation of possible implementations of systems methods and computer program products according to various embodiments of the present invention. In this regard each block in the flowchart or block diagrams may represent a module segment or portion of code which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations the functions noted in the block may occur out of the order noted in the figures. For example two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts or combinations of special purpose hardware and computer instructions.
In [33]:
exp.as_list()
Out[33]:
[('learning', 0.027807149071671613),
 ('output', 0.020810999673321742),
 ('complex', 0.020249491431132648),
 ('transaction', -0.020112884040263523),
 ('state', 0.017209915240610666),
 ('phone', -0.0152211199233059)]

from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation

try

In [34]:
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['transactions']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
Original prediction: 0.604
Prediction removing some features: 0.61
Difference: 0.006

They probably were not worth much. I need to do word removal or work with n-grams

Visualizing explanations

The explanations can be returned as a matplotlib barplot:

In [35]:
fig = exp.as_pyplot_figure()

The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.

In [36]:
exp.show_in_notebook(text=False)

Alternatively, we can save the fully contained html page to a file:

In [37]:
exp.save_to_file('/tmp/oi_stopwords_removed.html')

Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.

In [38]:
exp.show_in_notebook(text=True)

this explainer works for any classifier you may want to use, as long as it implements predict_proba.

Same as the first RandomForestClassifier but using claims instead of description

Lime Visualization example with 2 class classifier (only using 'clm' column

In [12]:
from sklearn.model_selection import train_test_split
X = df['clm']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

Let's use the tfidf vectorizer, commonly used for text.

In [13]:
from sklearn import feature_extraction
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False, max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.

In [14]:
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train)  #rf.fit(train_vectors, newsgroups_train.target)
Out[14]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [15]:
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
Out[15]:
array([ 0.92616226,  0.90229192])

Explaining predictions using lime

Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.

In [16]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
In [17]:
print(c.predict_proba([X_test.iloc[0]]))
[[ 0.784  0.216]]

Now we create an explainer object. We pass the class_names a an argument for prettier display.

In [18]:
from lime.lime_text import LimeTextExplainer
In [19]:
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)

We then generate an explanation with at most 6 features for an arbitrary document in the test set.

In [20]:
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
/opt/conda/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
Document id: 83
Probability(706) = 0.616
True class: 706
In [21]:
print(y_test.iloc[idx])
706
In [22]:
print(X_test.iloc[idx])
1. A method for process automation comprising: monitoring one or more workstations including monitoring screen contents and user actions at the workstations by executing a screen scraper module to obtain a dynamically updated current set of character and graphical information from screens of the workstations that includes user-entered data and retrieved screen data;  analyzing the current set to identify monitored functional events;  defining focal states as sequences of functional events wherein the current set comprises time intervals associated with the user actions respectively and the sequences of functional events of at least a portion of the focal states include the time intervals;  generating one or more facilitating scripts associated with respective ones of the focal states wherein the facilitating scripts each provide one or more automatic actions;  matching a sequence of the monitored functional events to the sequence of functional events of one of the focal states; and  applying the one or more automatic actions of the facilitating script associated with the one focal state;  wherein said steps are implemented in either:  computer hardware configured to perform said steps or computer software embodied in a non-transitory tangible computer-readable storage medium.                   2. The method as claimed in  claim 1  wherein matching a sequence of monitored functional events to a focal state includes fuzzy matching.                  3. The method as claimed in  claim 1  including requesting a user selection of one of multiple facilitating scripts associated with a matched focal state.                  4. The method as claimed in  claim 1  wherein providing multiple focal states and facilitating scripts includes providing pre-defined focal states and facilitating scripts as input by an administrator.                  5. The method as claimed in  claim 1  wherein providing multiple focal states includes: analyzing monitored screen contents and user actions to determine repeated functional events; and  automatically defining a focal state as a sequence of the monitored functional events.                   6. The method as claimed in  claim 5  wherein providing a facilitating script for a focal state includes: analyzing monitored user actions after the functional events of a focal state; and  automatically defining a facilitating script with actions corresponding to the monitored user actions.                   7. The method as claimed in  claim 1  including adapting a facilitating script including: evaluating a cost function of a system to which a facilitating script is applied; and  adapting a facilitating script to improve the cost function.                   8. The method as claimed in  claim 7  including iterating the steps of evaluating and adapting to optimize the facilitating script.                  9. The method as claimed in  claim 1  including monitoring multiple workstations involved in a single business process.                  10. A computer program product for process automation the computer program product comprising: a non-transitory computer readable storage medium in which computer program instructions are stored which instructions when executed by a computer cause the computer to perform the steps of:  monitoring one or more workstations including monitoring screen contents and user actions at the workstations by executing a screen scraper module to obtain a dynamically updated current set of character and graphical information from screens of the workstations that includes user-entered data and retrieved screen data;  analyzing the current set to identify monitored functional events;  defining focal states as sequences of functional events wherein the current set comprises time intervals associated with the user actions respectively and the sequences of functional events of at least a portion of the focal states include the time intervals;  generating one or more facilitating scripts associated with respective ones of the focal states wherein the facilitating scripts each provide one or more automatic actions;  matching a sequence of the monitored functional events to the sequence of functional events of one of the focal states; and  applying the one or more automatic actions of the facilitating script associated with the one focal state.                   11. A system for process automation comprising: a processor comprising:  a screen scraper module;  a monitoring agent for monitoring one or more workstations including monitoring screen contents and user actions at the workstations by executing the screen scraper module to obtain a dynamically updated current set of character and graphical information from screens of the workstations that includes user-entered data and retrieved screen data;  a current set module for analyzing the current set to identify monitored functional events;  a focal state provider for defining multiple focal states as sequences of functional events and a facilitating script provider providing one or more facilitating scripts associated with respective ones of the focal states wherein the current set comprises time intervals associated with the user actions respectively and the sequences of functional events of at least a portion of the focal states include the time intervals and wherein the facilitating scripts each provide one or more automatic actions;  a matching module for matching a sequence of monitored functional events to the sequence of functional events of one of the focal states; and  an applying module for applying the one or more automatic actions of the facilitating script associated with the one focal state.                   12. The system as claimed in  claim 11  wherein the matching module applies fuzzy matching of a sequence of monitored functional events to a focal state.                  13. The system as claimed in  claim 11  including a user input mechanism for providing user confirmation before applying the one or more automatic actions of a facilitating script.                  14. The system as claimed in  claim 11  including a user input mechanism for user selection of one of multiple facilitating scripts associated with a matched focal state.                  15. The system as claimed in  claim 11  wherein the focal state provider and the facilitating state provider provide pre-defined focal states and associated facilitating scripts as defined by an administrator.                  16. The system as claimed in  claim 11  wherein the focal state provider: analyses monitored screen contents and user actions to determine repeated functional events; and  automatically defines a focal state as a sequence of the monitored functional events.                   17. The system as claimed in  claim 16  wherein the facilitating script provider: analyses monitored user actions after the functional events of a focal state; and  automatically defines a facilitating script with actions corresponding to the monitored user actions.                   18. The system as claimed in  claim 11  including an adaptive mechanism to adapt facilitating scripts including: a cost function evaluation module for evaluating the cost to the system of applying a facilitating script;  an adaptive algorithm module; and  a perturbation mechanism for changing the facilitating script to improve the cost function.                   19. The system as claimed in  claim 18  wherein the adaptive mechanism iterates the adaptation of the facilitating script to optimize the facilitating script for the system.                  20. The method as claimed in  claim 11  wherein the monitoring agent monitors multiple workstations involved in a single business process.                  21. The method according to  claim 1  wherein analyzing the current set comprises analyzing a hierarchical structure of the screen contents.                  22. The method according to  claim 1  further comprising the step of executing a web scraper to obtain web content and wherein the current set includes the web content.                  23. The computer program product according to  claim 10  wherein analyzing the current set comprises analyzing a hierarchical structure of the screen contents.                  24. The system according to  claim 11  wherein analyzing the current set comprises analyzing a hierarchical structure of the screen contents.
In [23]:
exp.as_list()
Out[23]:
[('provider', -0.069503387429704036),
 ('state', 0.053083815917770921),
 ('transitory', -0.04938892781388856),
 ('states', 0.037563641572940232),
 ('set', 0.03555208942129659),
 ('claimed', -0.026522554197957238)]

from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation

try

In [24]:
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['provider']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
Original prediction: 0.616
Prediction removing some features: 0.636
Difference: 0.02

They probably were not worth much. I need to do word removal or work with n-grams

Visualizing explanations

The explanations can be returned as a matplotlib barplot:

In [25]:
fig = exp.as_pyplot_figure()

The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.

In [26]:
exp.show_in_notebook(text=False)

Alternatively, we can save the fully contained html page to a file:

In [27]:
exp.save_to_file('/tmp/oi_claim_stopwords_removed.html')

Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.

In [28]:
exp.show_in_notebook(text=True)

this explainer works for any classifier you may want to use, as long as it implements predict_proba.